STAT 1000 - Assignment 1 


Name (Student Number) 


2023-09-28 


Instructions 


To properly view the assignment questions, knit this file to .PDF and view the output. 


To enter the R-based questions, add code as needed into the R code chunks given below the question, and, 
where applicable, replace the “Delete me; ...” with own text response. Be sure when adding in text responses 
to never copy-paste symbols from outside of the document. Only use the symbols on your keyboard. Do 
not delete the question text, or modify any other part of the code except for the “author” in Line 3. All 
numerical and graphical answers must be done using R, unless stated otherwise. 


You will have a link in your email that takes you to the Crowdmark submission page. Once you have 
completed the worksheet, knit it to .PDF and upload your output to Crowdmark. Also, upload your .Rmd 
file to Crowdmark where prompted. To see where your .Rmd file is saved, click File > Save As in the top-left 
of your screen. Make sure you set your Name and Student Number in the Author section of this document 
(Line 3). Do not alter the title or the date. Please note that if you do not submit a knit .PDF file, you will 
be given a grade of zero on the R component of the assignment. 


After you knit your assignment to PDF, check your code chunks. If your code at any point runs off the page, 
find the nearest comma, click to the right of it, and press Enter (or Return if you are on a Mac). This will 
force a break in the code so that it goes onto the next line. All of your code must be readable in the final 
submission. 


For the R-based questions, all calculations and output must be visible in the final knit PDF, and all text 
responses should be in complete English sentences. Your work should be done using the same formatting, 
functions, and packages as in your labs and course notes, unless otherwise specified. You may speak to 
your class mates about ideas and what functions/optional arguments you may need to use but you may not 
directly show your code/output to your classmates. 


Several questions will require you write up your work on paper (or a tablet). Note that illegible work may 
receive a mark of zero. Use lined paper if possible, and clean up your work before you submit it. You will 
need to either scan your written work, screenshot / save to PDF if your work is done on a tablet, or take a 
picture of your work. Ensure that your submissions are rotated correctly in Crowdmark before you submit 
(any incorrectly rotated submissions may receive a grade of zero). Crowdmark will let you preview your 
submission and rotate it any images before you submit. 


Each Question and each Part must be clearly separated. To this end, make sure that each Part is separated 
by at least four lines, and that each question starts on a new page. 


Whether or not it is specified, all written questions require you to show your full work and reasoning. Answers 
given without proper reasoning / work may receive a grade of zero. All calculations should be given to two 
decimal places, unless otherwise stated. 


Your full submission is due by 11:59 p.m. on Saturday, September 30. Crowdmark may allow you to submit 
late, but you will be given an automatic grade of zero if you do. If you have an issue that you can’t resolve 
without someone looking at your work (e.g., you get an error when knitting your document), please see the 
Help Centre in 311 Machray Hall. 


Questions [50 marks] 


Question 1 [5 marks] 


Part (a) [1 mark] The Company dataset contains the Employee Numbers, Ages, Yearly Earnings, De- 
partment, Years of Service, Seniority Rank, and Pension for a sample of nearly 3000 employees at a large 
company. Import it below. 


Company <- read.csv("~/Downloads/Company.csv") 


head (Company ) 

## $ENum Ages Earnings Department Years Seniority Pension 
## 1 1 41 105054 Xport 22.3 IV 326777 .570 
## 2 2 39 82411 Mfg 14.4 III 145752.096 
## 3 3 52 147938 CS 34.0 IV 854048 .685 
Ht 4 4 25 52935 RD 0.5 I 2627 .192 
## 5 5 25 111363 RD 6.9 II 83983.988 
## 6 6 31 69364 CS 4.7 II 34459.918 


Part (b) [1 mark] As we saw in our first lab, you can make a simple histgram in R by typing hist (x), 
where x is your dataset. 


Below, make a basic histogram of the Earnings variable . 


hist (Company$Earnings) 
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Part (c) [2 marks] We can add more details to this graph through what are called arguments. 


The additional arguments of a function must stay within the brackets of the function, must be separated by 
commas, and must be referenced by name. For the hist function, the most common arguments are breaks, 
main, xlab, ylab, and col. To supply these arguments to hist, we would use the syntax hist(x, breaks 
= ..., main = ..., xlab = ..., ylab = ..., and col = ...). In this function: 


e The vector x is the dataset we provide. 

e The argument breaks allows us to suggest a number of bins for the histogram. Note that R will often 
not give you the exact number of bins that you ask for. Instead, it will give you the nearest number of 
bins that it determines to be “pretty”. We will not go into how this calculation is made in this course. 

e The argument main allows us to give a name to the histogram. Note that, since you will be supplying 
a set of characters as your argument, this will need to be wrapped in quotation marks. That is, to set 
a title of MY NEW TITLE, you would enter main = "MY NEW TITLE". 

e The argument xlab allows us to give a name to the x-axis of the histogram. 

e The argument ylab allows us to give a name to the y-axis of the histogram. 

e The argument col allows us to set the colours of the bars of the histogram. For a list of all available 
colours in R, follow this link here. Like main, xlab, and ylab, you will have to surround the colour 
name in quotation marks. 


If you do not specify an argument, it will be left at its default value. 


Below, create another histogram of the Earnings variable. Set this histogram to have 30 breaks, a main 
title of “Yearly Earnings”, an x-axis label of “Yearly Earnings ($)”, a y-axis label of “Counts”, and a colour 
of “firebrick1”. 


If your code runs off the page (you can tell where this is going to happen by looking for the thin grey vertical 
line spanning the screen in RMarkdown), make sure you break the line by pressing Enter / Return after a 
comma. 


hist (Company$Earnings, breaks = 30, main = "Yearly Earnings", 
xlab = "Yearly Earnings ($)", ylab = "Counts" ,col ="firebricki") 
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Part (d) [1 mark] What is the shape of this dataset? Based on this, do you expect the mean to be above 
or below the median? 


This database is skewed to the right we expect mean to be above median 


Question 2 [7 marks] 


Part (a) [1 mark] The Stores dataset contains the weekly sales for 3 branches of a chain of dollar stores 
in Winnipeg, across two years. Import it below: 


Stores <- read.csv("~/Downloads/Stores.csv") 
Part (b) [1 mark] R may be used to quickly create boxplots, using the function boxplot. To make a 


simple quantile boxplot in R, we can enter boxplot(x, range = 0), where x is the dataset that we are 
interested in. 


Below, make a basic quantile boxplot of the Store1 sales. 


boxplot (Stores$Storel, range = 0) 
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Like in hist, we can use the main, xlab, ylab, and col arguments as before. We can also add the argument 
horizontal = TRUE to make this into a horizontal boxplot. If we leave this argument out, it will stay as a 
vertical boxplot. 


Part (c) [2 marks] Below, create a horizontal quantile boxplot of the Store1 variable. Set this boxplot 
to have a title of “Store 1 sales”, an x-axis label of “Weekly Sales ($)”, and a colour of lightsteelblue2. 


boxplot (Stores$Store1, range = 0, main = "store 1 sales", 
xlab = "Weekly Sales ($)", col "lightsteelblue2", horizontal = TRUE) 


store 1 sales 


4000 6000 8000 10000 12000 14000 


Weekly Sales ($) 


Part (d) [1 mark] If we instead want to make an outlier boxplot, we can remove the argument range = 
0. Recreate the above boxplot, but make this into an outlier boxplot instead of a quantile boxplot. 


boxplot (Stores$Store1, main = "store 1 sales", 
xlab = "Weekly Sales ($)", col = "lightsteelblue2", horizontal = TRUE) 
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Part (e) [2 mark] To make a side-by-side boxplot, we just add all the datasets in order before the other 
arguments. For example, to make a side-by-side boxplot of three datasets x, y, and z, we would enter 
boxplot(x, y, 2). 


Give each boxplot a name, you can use the names argument. Note that you will need to supply a vector of 
names here. For example, to name the three boxplots in the example above, we would supply the argument 
names = c("X" ; myn ; LV ALD 


Below, create a side-by-side outlier boxplot comparing the three stores’ sales. Set the main title to “Branch 
Sales”, and the y-axis label to “Weekly Sales ($)”. Use the names argument to name the boxplots “Store 1”, 
“Store 2”, etc. 


boxplot (Stores$Store1 ,Stores$Store2,Stores$Store3, main = "Branch Sales", 
names = c("Store 1","Store 2", "Store 3"), ylab "Weekly Sales ($)") 
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Question 3 [3 marks] 


Mulyadi is a food reviewer with a popular blog online. When we eats at a restaurant, he notes the following 
things about his meal: 


e Cost: Mulyadi records, in dollars, the total cost for his meal. 

e Service Quality: Mulyadi rates the quality of his service as either Poor, Fair, or Great. 

e Type of Food Served: Mulyadi records the type of food served, e.g., Mediterranean, Mexican, Greek, 
etc. 

e Dietary Exceptions: Mulyadi notes whether the restaurant has Vegetarian options, Vegan options, 
both, or neither. 

e Wait Time: Mulyadi records, in minutes, how long it took him to receive his food after making his 
order. 

e Overall Score: Mulyadi gives the restaurant a final score out of five stars (1 star = bad, 2 stars = 
fair, 3 stars = good, 4 stars = very good, 5 stars = excellent). 


Classify each of these variables as either categorical nominal, categorical ordinal, or quantitative. Replace 
the ANSWER HERE with your own response, but do not remove the asterisks. 


e Cost is... Quantitative 

e Service Quality is... Categorial Ordinal 

e Type of Food Served is... Categorial Ordinal 
e Dietary Exceptions is... Categorial Ordinal 

e Wait Time is... Quantitative 

e Overall Score is... Categorial Nominal 


Question 4 [13 marks] 


Below are 18 observations, each representing a wait time for customers on a technical support call line. 


1 2 6 7 9 10 11 12 «14 
15 16 17 #19 22 27 39 47 58 


Part (a) [1 mark] Calculate the median of this dataset. 


Do this work on a separate piece of paper. Reference the assignment instructions for 
details on how to format your work. Show all of your work for all written questions. 


Part (b) [2 marks] Calculate the five number summary of this dataset. 


Do this work on a separate piece of paper 


Part (c) [1 mark] What would be the fences when constructing an outlier boxplot of this dataset? 


Do this work on a separate piece of paper 


Part (d) [1 mark] Which values would be marked as suspected outliers in this dataset? 


Do this work on a separate piece of paper 


Part (e) [3 mark] Repeat Parts (a) through (c) in R, in the code chunk below. You may calculate the 
median of a dataset by entering median (x), where x is the name of your dataset. Similarly, you can calculate 
the five number summary with the fivenum function, and the IQR with the IQR function. 


x<- ¢(1,2,6,7,9,10,11,12,14,15,16,17,19,22,27,39,47,58) 


X<— C426. On Onde ead bel Gal (ed On ODM S30, Arbo) 
fivenum (x) 


## [1] 1.0 9.0 14.5 22.0 58.0 


IQR (x) 


## [1] 12 


Part (f) [2 marks] Use R to construct a horizontal outlier boxplot for this dataset. Set an appropriate 
title for the graph, as well as the x-axis. 


boxplot(x, range = 0, main = "Technical support waiting time", 
xlab = "Time", horizontal = TRUE) 
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Part (g) [2 marks] What is the shape of this dataset? Based on this, do you expect the mean to be less 
than, greater than, or approximately equal to the median? 


the database is skewed to the right,therefor the mean will be greater than the median. 


Part (h) [1 mark] Confirm your discussion in Part (g) by calculating the mean in R. 


mean (x) 


## [1] 18.44444 
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Question 5 [9 marks] 


The below dataset represents the calorie counts for the 7 items at a small, health-focused restaurant. 


410 580 410 510 670 510 480 


Part (a) [1 mark] Calculate the mean of this dataset. 


Do this work on a separate piece of paper. 


Part (b) [2 marks] Calculate the standard deviation of this dataset. 


Do this work on a separate piece of paper. 


Part (c) [2 marks] Suppose that a new observation is added to this dataset - a bowl of soup with a 
calorie count of 510. Will this cause the mean of this sample to increase, decrease, or remain the same? 
Similarly, will it cause the standard deviation of this sample to increase, decrease, or stay the same? Answer 
this question without making any calculations, and explain your reasoning in full. 


Do this work on a separate piece of paper. 


Part (d) [2 marks] Confirm your results in Part (a) and (b) by repeating the calculations in R. You can 
calculate the standard deviation of a dataset with the sd function. 


zZ<— (6410-580), 41107670), 5105 5102480) 
sd(z) 


## [1] 92.55629 


mean (z) 


## [1] 510 


Part (e) [2 mark] Confirm your results in Part (c) by recalculating the mean and standard deviation in 
R. 


y<- c(410,580,410,670,510,510,480,510) 
mean (y) 


## [1] 510 


sd(y) 


## [1] 85.69047 
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Question 6 [8 marks] 


Below is a frequency distribution, representing the speeds of a sample of cars and cyclists along a certain 
road. The measurements are in kilometers per hour. 


Speed Frequency 


0 — 20 5 
20 — 40 12 
40 — 60 33 
60 — 80 31 
80 — 100 13 

100 — 120 5 


Part (a) [1 mark] What is the relative frequency of the 60 — 80 interval? 


Do this work on a separate piece of paper. 


Part (b) [1 mark] Without making any calculations, do you expect the mean to be greater than, less 
than, or approximately equal to the median? 


Do this work on a separate piece of paper. 


Part (c) [1 marks] Which interval will contain the first quartile of observations? 


Do this work on a separate piece of paper. 


Part (d) [4 marks] R can also be used to create a histogram from a frequency table. However, you will 
have to specify the bars and the heights manually. 


First, you will have to create a vector containing the heights of each bar. That is, you should enter heights 
<- c(5, 12, ...). Replace the ... with the remaining heights from the frequency table (ignore the line 
break here). 


Once you have created the heights vectors, you can put them into the barplot function to create the bar 
plot. You will have to add the argument xaxt = "n" to prevent the default x-axis from showing, and you 
will have to add the argument space = 0 to remove the space between the bars. That is, you should enter 
barplot (heights, xaxt = "n", space = 0). Note that you can add the same arguments as you did in 
hist to create a title, x-axis label, etc. 


Finally, you will have to create the x-axis. To do so, on the next line enter the following: axis(1, at = 
0:6, labels = c(O, 20, 40, 60, 80, 100, 120)). Don’t worry about the details of what’s happening 
here; all we’re doing is manually creating the x-axis. 


Follow the steps above to create a histogram from the given frequency table. Apply the main, xlab, and 


ylab arguments to set a meaningful title, x-axis label, and y-axis label. 


hedightse<— cet 533 311315) 
barplot (heights, xaxt = "n", space = 0, main = "speed for cars and cyclists", xlab = "frequency", ylab : 
axis(1, at = 0:6, labels = c(0, 20, 40, 60, 80, 100, 120)) 
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Question 7 [5 marks] 


One product review platform, called Rainforest, allows customers to rate their purchase out of 5 stars. One 
company receives the following ratings for their phone cable, StarCharge: 


Stars Frequency 


5 71 
4 113 
3 64 
2 20 
1 56 


Part (a) [1 mark] What is the average rating for this product on Rainforest? 


Do this work on a separate piece of paper. 


Part (b) [1 mark] What is the median rating for this product on Rainforest? 


Do this work on a separate piece of paper. 


Part (c) [1 mark] On a separate review platform, called QuickShip, the same product receives a total 
of 443 reviews, with an average rating of 3.67 stars. What is the combined average review across these two 
platforms? 


Do this work on a separate piece of paper. 


Part (d) [2 marks] “Review-bombing” refers to the practice of flooding a product’s reviews with negative 
reviews in order to harm the sales or popularity of a product. Suppose it is discovered that the StarCharge 
product has been review-bombed on Rainforest, and that 9 of the one-star, as well as 19 of the two-star 
ratings, are in fact from a single user with multiple Rainforest accounts. After removing these one-star and 
two-star reviews from the dataset, what will the new average rating be on Rainforest? 


Do this work on a separate piece of paper. 
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